This notebook explores the more recent data from NYC Open Data Data Set.
This dataset can also be reached and interacted with through its Google BigQuery location
For this exercise, we’d like you to analyze data on New York motor vehicle collisions and answer the following question:
What are your ideas for reducing accidents in Brooklyn?
Imagine you are preparing this presentation for the city council who will use it to inform new legislation and/or projects.
Libraries that will be used during exploration
library(magrittr)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
Sys.setenv('MAPBOX_TOKEN' =
'pk.eyJ1IjoiaXJqZXJhZCIsImEiOiJjajA4cWNkajkwMjRrMnFvNnlwMGFhZmM5In0.5PGU5SV2qdyix9tSEkjMgg')
Load data from /data directory and into memory
dt <- read.csv(file = "data/NYPD_Motor_Vehicle_Collisions.csv")
Inspect structure of dataset with the str() command:
str(dt)
'data.frame': 990800 obs. of 29 variables:
$ DATE : Factor w/ 1709 levels "01/01/2013","01/01/2014",..: 200 200 200 200 1068 920 611 200 65 65 ...
$ TIME : Factor w/ 1440 levels "0:00","0:01",..: 556 631 632 644 661 936 46 686 1051 1066 ...
$ BOROUGH : Factor w/ 6 levels "","BRONX","BROOKLYN",..: 1 2 2 3 1 1 1 1 3 3 ...
$ ZIP.CODE : int NA 10454 10466 11218 NA NA NA NA 11218 11236 ...
$ LATITUDE : num 40.7 40.8 40.9 40.6 40.7 ...
$ LONGITUDE : num -73.9 -73.9 -73.9 -74 -73.9 ...
$ LOCATION : Factor w/ 90272 levels "","(0.0, 0.0)",..: 28545 73165 89328 17808 52600 1 1 14824 17763 18379 ...
$ ON.STREET.NAME : Factor w/ 9151 levels "","?EST 125 STREET",..: 1420 1 3493 1 1 1 6598 4069 738 7071 ...
$ CROSS.STREET.NAME : Factor w/ 9585 levels "","0","01247",..: 1 1 9364 1 1 1 7399 3638 114 4201 ...
$ OFF.STREET.NAME : Factor w/ 59908 levels "","(26 BROOKLYN TERMINAL MARKET LOT)",..: 1 38225 1 29898 1 1 1 1 1 1 ...
$ NUMBER.OF.PERSONS.INJURED : int 0 0 1 0 0 0 0 0 1 2 ...
$ NUMBER.OF.PERSONS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.PEDESTRIANS.INJURED: int 0 0 1 0 0 0 0 0 0 0 ...
$ NUMBER.OF.PEDESTRIANS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.CYCLIST.INJURED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.CYCLIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.MOTORIST.INJURED : int 0 0 0 0 0 0 0 0 1 2 ...
$ NUMBER.OF.MOTORIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ CONTRIBUTING.FACTOR.VEHICLE.1: Factor w/ 49 levels "","Accelerator Defective",..: 10 47 47 47 47 11 1 43 43 43 ...
$ CONTRIBUTING.FACTOR.VEHICLE.2: Factor w/ 49 levels "","Accelerator Defective",..: 47 1 1 47 47 47 1 47 47 47 ...
$ CONTRIBUTING.FACTOR.VEHICLE.3: Factor w/ 43 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 42 1 ...
$ CONTRIBUTING.FACTOR.VEHICLE.4: Factor w/ 42 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CONTRIBUTING.FACTOR.VEHICLE.5: Factor w/ 31 levels "","Aggressive Driving/Road Rage",..: 1 1 1 1 1 1 1 1 1 1 ...
$ UNIQUE.KEY : int 3612721 3612791 3618743 3614471 3284922 2833714 336679 3618925 3598095 3597360 ...
$ VEHICLE.TYPE.CODE.1 : Factor w/ 18 levels "","AMBULANCE",..: 15 10 12 15 10 10 1 10 10 15 ...
$ VEHICLE.TYPE.CODE.2 : Factor w/ 18 levels "","AMBULANCE",..: 10 1 1 10 16 10 1 10 10 15 ...
$ VEHICLE.TYPE.CODE.3 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 15 1 ...
$ VEHICLE.TYPE.CODE.4 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ VEHICLE.TYPE.CODE.5 : Factor w/ 16 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
Inspect summary of dataset with summary() command:
summary(dt)
DATE TIME BOROUGH
01/21/2014: 1161 16:00 : 12792 :260725
01/18/2015: 960 15:00 : 12748 BRONX : 95396
02/03/2014: 960 17:00 : 12597 BROOKLYN :223552
03/06/2015: 936 18:00 : 11641 MANHATTAN :187571
01/07/2017: 887 14:00 : 11094 QUEENS :189619
09/30/2016: 872 13:00 : 10365 STATEN ISLAND: 33937
(Other) :985024 (Other):919563
ZIP.CODE LATITUDE LONGITUDE
Min. :10000 Min. : 0.00 Min. :-201.36
1st Qu.:10075 1st Qu.:40.67 1st Qu.: -73.98
Median :11205 Median :40.72 Median : -73.93
Mean :10808 Mean :40.72 Mean : -73.92
3rd Qu.:11236 3rd Qu.:40.77 3rd Qu.: -73.87
Max. :11697 Max. :40.91 Max. : 0.00
NA's :260826 NA's :201443 NA's :201443
LOCATION ON.STREET.NAME
:201443 :188246
(40.6960346, -73.9845292): 673 BROADWAY : 10832
(40.7606005, -73.9643142): 544 ATLANTIC AVENUE : 9354
(40.7572323, -73.9897922): 485 NORTHERN BOULEVARD: 7490
(40.6757357, -73.8968533): 480 3 AVENUE : 6864
(40.6585778, -73.8906229): 464 FLATBUSH AVENUE : 6500
(Other) :786711 (Other) :761514
CROSS.STREET.NAME OFF.STREET.NAME
:217648 :916764
3 AVENUE: 11407 PARKING LOT 110-00 ROCKAWAY BOULEVARD : 150
BROADWAY: 11088 PARKING LOT-772 EDGEWATER RD : 91
2 AVENUE: 9678 PARKING LOT OF 110-00 ROCKAWAY BOULEVARD: 90
5 AVENUE: 7846 3 AVENUE : 72
7 AVENUE: 7312 2 AVENUE : 67
(Other) :725821 (Other) : 73566
NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
Min. : 0.0000 Min. :0.000000
1st Qu.: 0.0000 1st Qu.:0.000000
Median : 0.0000 Median :0.000000
Mean : 0.2552 Mean :0.001214
3rd Qu.: 0.0000 3rd Qu.:0.000000
Max. :43.0000 Max. :5.000000
NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
Min. : 0.00000 Min. :0.0000000
1st Qu.: 0.00000 1st Qu.:0.0000000
Median : 0.00000 Median :0.0000000
Mean : 0.05455 Mean :0.0006833
3rd Qu.: 0.00000 3rd Qu.:0.0000000
Max. :15.00000 Max. :2.0000000
NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED
Min. :0.00000 Min. :0.00e+00
1st Qu.:0.00000 1st Qu.:0.00e+00
Median :0.00000 Median :0.00e+00
Mean :0.02093 Mean :7.47e-05
3rd Qu.:0.00000 3rd Qu.:0.00e+00
Max. :6.00000 Max. :1.00e+00
NUMBER.OF.MOTORIST.INJURED NUMBER.OF.MOTORIST.KILLED
Min. : 0.0000 Min. :0.000000
1st Qu.: 0.0000 1st Qu.:0.000000
Median : 0.0000 Median :0.000000
Mean : 0.1927 Mean :0.000463
3rd Qu.: 0.0000 3rd Qu.:0.000000
Max. :43.0000 Max. :5.000000
CONTRIBUTING.FACTOR.VEHICLE.1
Unspecified :523736
Driver Inattention/Distraction:127688
Fatigued/Drowsy : 48249
Failure to Yield Right-of-Way : 42948
Other Vehicular : 30393
Backing Unsafely : 27886
(Other) :189900
CONTRIBUTING.FACTOR.VEHICLE.2
Unspecified :738985
:123724
Driver Inattention/Distraction: 37843
Other Vehicular : 17711
Fatigued/Drowsy : 13016
Failure to Yield Right-of-Way : 9087
(Other) : 50434
CONTRIBUTING.FACTOR.VEHICLE.3
:925696
Unspecified : 59537
Other Vehicular : 1225
Fatigued/Drowsy : 1122
Driver Inattention/Distraction: 1100
Pavement Slippery : 234
(Other) : 1886
CONTRIBUTING.FACTOR.VEHICLE.4
:976717
Unspecified : 12938
Fatigued/Drowsy : 222
Other Vehicular : 221
Driver Inattention/Distraction: 192
Pavement Slippery : 67
(Other) : 443
CONTRIBUTING.FACTOR.VEHICLE.5 UNIQUE.KEY
:987360 Min. : 22
Unspecified : 3186 1st Qu.: 249509
Other Vehicular : 52 Median :3131520
Fatigued/Drowsy : 48 Mean :2054070
Driver Inattention/Distraction: 36 3rd Qu.:3379220
Pavement Slippery : 23 Max. :3627969
(Other) : 95
VEHICLE.TYPE.CODE.1
PASSENGER VEHICLE :579372
SPORT UTILITY / STATION WAGON:218537
TAXI : 37190
VAN : 26511
OTHER : 24699
UNKNOWN : 20713
(Other) : 83778
VEHICLE.TYPE.CODE.2
PASSENGER VEHICLE :438701
SPORT UTILITY / STATION WAGON:165455
:134997
UNKNOWN : 80864
TAXI : 31205
OTHER : 25249
(Other) :114329
VEHICLE.TYPE.CODE.3
:926878
PASSENGER VEHICLE : 38181
SPORT UTILITY / STATION WAGON: 15761
UNKNOWN : 3240
VAN : 1401
TAXI : 1163
(Other) : 4176
VEHICLE.TYPE.CODE.4
:977085
PASSENGER VEHICLE : 8441
SPORT UTILITY / STATION WAGON: 3553
UNKNOWN : 583
VAN : 248
OTHER : 205
(Other) : 685
VEHICLE.TYPE.CODE.5
:987436
PASSENGER VEHICLE : 2072
SPORT UTILITY / STATION WAGON: 958
UNKNOWN : 94
OTHER : 52
VAN : 50
(Other) : 138
Our Dataset structure revealed the variables and their classes sapply(names(dt), function(x) paste0(x, ' is class: ', class(dt[[x]])))=
DATE is class: factor,
TIME is class: factor,
BOROUGH is class: factor,
ZIP.CODE is class: integer,
LATITUDE is class: numeric,
LONGITUDE is class: numeric,
LOCATION is class: factor,
ON.STREET.NAME is class: factor,
CROSS.STREET.NAME is class: factor,
OFF.STREET.NAME is class: factor,
NUMBER.OF.PERSONS.INJURED is class: integer,
NUMBER.OF.PERSONS.KILLED is class: integer,
NUMBER.OF.PEDESTRIANS.INJURED is class: integer,
NUMBER.OF.PEDESTRIANS.KILLED is class: integer,
NUMBER.OF.CYCLIST.INJURED is class: integer,
NUMBER.OF.CYCLIST.KILLED is class: integer,
NUMBER.OF.MOTORIST.INJURED is class: integer,
NUMBER.OF.MOTORIST.KILLED is class: integer,
CONTRIBUTING.FACTOR.VEHICLE.1 is class: factor,
CONTRIBUTING.FACTOR.VEHICLE.2 is class: factor,
CONTRIBUTING.FACTOR.VEHICLE.3 is class: factor,
CONTRIBUTING.FACTOR.VEHICLE.4 is class: factor,
CONTRIBUTING.FACTOR.VEHICLE.5 is class: factor,
UNIQUE.KEY is class: integer,
VEHICLE.TYPE.CODE.1 is class: factor,
VEHICLE.TYPE.CODE.2 is class: factor,
VEHICLE.TYPE.CODE.3 is class: factor,
VEHICLE.TYPE.CODE.4 is class: factor,
VEHICLE.TYPE.CODE.5 is class: factor
Change Unmarked Boroughs from NA to “NONE GIVEN”
levels(dt$BOROUGH)[levels(dt$BOROUGH) == ""] <- "BOROUGH NA"
With Latitude and Longitude present and appearing to be fairly well documented, let’s take a quick look at how these accidents look over an interactive world map (incase of mistakes outlying somewhere aside from New York). We will use the BOROUGH variable as a factor. This gives the geographic association of each borough and allows us early forsight into anything specific about our point of interest BOURGH == "BROOKLYN"
mp <- dt %>% plot_mapbox(lat = ~LATITUDE, lon = ~LONGITUDE, split = ~BOROUGH, mode = 'scattermapbox')
plotly_build(mp)